Goto

Collaborating Authors

 Nearest Neighbor Methods


Distributionally Robust Weighted k-Nearest Neighbors

Neural Information Processing Systems

Learning a robust classifier from a few samples remains a key challenge in machine learning. A major thrust of research has been focused on developing k-nearest neighbor (k-NN) based algorithms combined with metric learning that captures similarities between samples. When the samples are limited, robustness is especially crucial to ensure the generalization capability of the classifier. In this paper, we study a minimax distributionally robust formulation of weighted k-nearest neighbors, which aims to find the optimal weighted k-NN classifiers that hedge against feature uncertainties. We develop an algorithm, Dr.k-NN, that efficiently solves this functional optimization problem and features in assigning minimax optimal weights to training samples when performing classification. These weights are class-dependent, and are determined by the similarities of sample features under the least favorable scenarios. The proposed framework can be shown to be equivalent to a Lipschitz norm regularization problem.


The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Neural Information Processing Systems

We analyze the Kozachenko-Leonenko (KL) fixed k-nearest neighbor estimator for the differential entropy. We obtain the first uniform upper bound on its performance for any fixed k over Hรถlder balls on a torus without assuming any conditions on how close the density could be from zero. Accompanying a recent minimax lower bound over the Hรถlder ball, we show that the KL estimator for any fixed k is achieving the minimax rates up to logarithmic factors without cognizance of the smoothness parameter s of the Hรถlder ball for s (0, 2] and arbitrary dimension d, rendering it the first estimator that provably satisfies this property.


Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

Neural Information Processing Systems

Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for "overfitted" / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is consistently robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted k-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems.


Rates of Convergence for Large-scale Nearest Neighbor Classification

Neural Information Processing Systems

Nearest neighbor is a popular class of classification methods with many desirable properties. For a large data set which cannot be loaded into the memory of a single machine due to computation, communication, privacy, or ownership limitations, we consider the divide and conquer scheme: the entire data set is divided into small subsamples, on which nearest neighbor predictions are made, and then a final decision is reached by aggregating the predictions on subsamples by majority voting. We name this method the big Nearest Neighbor (bigNN) classifier, and provide its rates of convergence under minimal assumptions, in terms of both the excess risk and the classification instability, which are proven to be the same rates as the oracle nearest neighbor classifier and cannot be improved. To significantly reduce the prediction time that is required for achieving the optimal rate, we also consider the pre-training acceleration technique applied to the bigNN method, with proven convergence rate. We find that in the distributed setting, the optimal choice of the neighbor k should scale with both the total sample size and the number of partitions, and there is a theoretical upper limit for the latter. Numerical studies have verified the theoretical findings.


Multi-scale Consistency for Robust 3D Registration via Hierarchical Sinkhorn Tree

Neural Information Processing Systems

We study the problem of retrieving accurate correspondence through multi-scale consistency (MSC) for robust point cloud registration. Existing works in a coarseto-fine manner either suffer from severe noisy correspondences caused by unreliable coarse matching or struggle to form outlier-free coarse-level correspondence sets. To tackle this, we present Hierarchical Sinkhorn Tree (HST), a pruned tree structure designed to hierarchically measure the local consistency of each coarse correspondence across multiple feature scales, thereby filtering out the local dissimilar ones. In this way, we convert the modeling of MSC for each correspondence into a BFS traversal with pruning of a K-ary tree rooted at the superpoint, with its K nearest neighbors in the feature pyramid serving as child nodes. To achieve efficient pruning and accurate vicinity characterization, we further propose a novel overlap-aware Sinkhorn Distance, which retains only the most likely overlapping points for local measurement and next level exploration. The modeling process essentially involves traversing a pair of HSTs synchronously and aggregating the consistency measures of corresponding tree nodes. Extensive experiments demonstrate HST consistently outperforms the state-of-the-art methods on both indoor and outdoor benchmarks.


An adaptive nearest neighbor rule for classification

Neural Information Processing Systems

We introduce a variant of the k-nearest neighbor classifier in which k is chosen adaptively for each query, rather than being supplied as a parameter. The choice of k depends on properties of each neighborhood, and therefore may significantly vary between different points. For example, the algorithm will use larger k for predicting the labels of points in noisy regions. We provide theory and experiments that demonstrate that the algorithm performs comparably to, and sometimes better than, k-NN with an optimal choice of k. In particular, we bound the convergence rate of our classifier in terms of a local quantity we call the "advantage", giving results that are both more general and more accurate than the smoothness-based bounds of earlier nearest neighbor work. Our analysis uses a variant of the uniform convergence theorem of Vapnik-Chervonenkis that is for empirical estimates of conditional probabilities and may be of independent interest.


A Theory-Based Evaluation of Nearest Neighbor Models Put Into Practice

Neural Information Processing Systems

In the k-nearest neighborhood model (k-NN), we are given a set of points P, and we shall answer queries q by returning the k nearest neighbors of q in P according to some metric. This concept is crucial in many areas of data analysis and data processing, e.g., computer vision, document retrieval and machine learning. Many k-NN algorithms have been published and implemented, but often the relation between parameters and accuracy of the computed k-NN is not explicit.


One-Layer Transformer Provably Learns One-Nearest Neighbor In Context Cheng Gao

Neural Information Processing Systems

Transformers have achieved great success in recent years. Interestingly, transformers have shown particularly strong in-context learning capability - even without fine-tuning, they are still able to solve unseen tasks well purely based on taskspecific prompts. In this paper, we study the capability of one-layer transformers in learning one of the most classical nonparametric estimators, the one-nearest neighbor prediction rule. Under a theoretical framework where the prompt contains a sequence of labeled training data and unlabeled test data, we show that, although the loss function is nonconvex when trained with gradient descent, a single softmax attention layer can successfully learn to behave like a one-nearest neighbor classifier. Our result gives a concrete example of how transformers can be trained to implement nonparametric machine learning algorithms, and sheds light on the role of softmax attention in transformer models.


k-NN as a Simple and Effective Estimator of Transferability

arXiv.org Artificial Intelligence

How well can one expect transfer learning to work in a new setting where the domain is shifted, the task is different, and the architecture changes? Many transfer learning metrics have been proposed to answer this question. But how accurate are their predictions in a realistic new setting? We conducted an extensive evaluation involving over 42,000 experiments comparing 23 transferability metrics across 16 different datasets to assess their ability to predict transfer performance. Our findings reveal that none of the existing metrics perform well across the board. However, we find that a simple k-nearest neighbor evaluation -- as is commonly used to evaluate feature quality for self-supervision -- not only surpasses existing metrics, but also offers better computational efficiency and ease of implementation.


TPU-KNN K Nearest Neighbor Search at Peak FLOP/s

Neural Information Processing Systems

This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-theart GPU algorithms with similar level of recall. The design of the proposed algorithm is motivated by an accurate accelerator performance model that takes into account both the memory and instruction bottlenecks. Our algorithm comes with an analytical guarantee of recall in expectation and does not require maintaining sophisticated index data structure or tuning, making it suitable for applications with frequent updates. Our work is available in the open-source package of Jax and Tensorflow on TPU.